DATA 621 01[46893] : HomeWork1
CUNY_MSDA_ DATA 621_Homework
DATA 621 01[46893] : HomeWork1
1 Overview
In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
We have been given a dataset with 2276 records summarizing a major league baseball team’s season. The records span 1871 to 2006 inclusive. All statistics have been adjusted to match the performance of a 162 game season.
Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided).
Glossary of data
data.frame(
`Variable Name` = c("INDEX","TARGET_WINS","TEAM_BATTING_H","TEAM_BATTING_2B","TEAM_BATTING_3B","TEAM_BATTING_HR","TEAM_BATTING_BB","TEAM_BATTING_HBP",
"TEAM_BATTING_SO","TEAM_BASERUN_SB","TEAM_BASERUN_CS","TEAM_FIELDING_E","TEAM_FIELDING_DP","TEAM_PITCHING_BB","TEAM_PITCHING_H","TEAM_PITCHING_HR","TEAM_PITCHING_SO"),
`Definition` = c("Identification Variable (do not use)","Number of wins","Base Hits by batters (1B,2B,3B,HR)","Doubles by batters (2B)","Triples by batters (3B)","Homeruns by batters (4B)","Walks by batters","Batters hit by pitch (get a free base)","Strikeouts by batters","Stolen bases","Caught stealing","Errors","Double Plays","Walks allowed","Hits allowed","Homeruns allowed","Strikeouts by pitchers"),
`THEORETICAL EFFECT` = c("None","","Positive Impact on Wins","Positive Impact on Wins","Positive Impact on Wins","Positive Impact on Wins","Positive Impact on Wins","Positive Impact on Wins","Negative Impact on Wins","Positive Impact on Wins","Negative Impact on Wins","Negative Impact on Wins","Positive Impact on Wins","Negative Impact on Wins","Negative Impact on Wins","Negative Impact on Wins","Positive Impact on Wins")
) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)
Below is a short description of the variables of interest in the data set:
2 Deliverables
- A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.
- Assigned predictions (the number of wins for the team) for the evaluation data set.
- Include your R statistical programming code in an Appendix.
3 DATA EXPLORATION
The data set describes baseball team statistics for the years 1871 to 2006 inclusive. Each record in the data set represents the performance of the team for the given year adjusted to the current length of the season - 162 games. The data set includes 16 variables and the training set includes 2,276 records.
Load the data and understand the data by using some stats and plots.
mtd <- read.csv("https://raw.githubusercontent.com/Rajwantmishra/DATA621_CR4/master/HW1/Deb/moneyball-training-data.csv")
med <- read.csv("https://raw.githubusercontent.com/Rajwantmishra/DATA621_CR4/master/HW1/Deb/moneyball-evaluation-data.csv")
View rows and columns, variable types
Glimpse of the data
glimpse(mtd)
## Observations: 2,276
## Variables: 17
## $ INDEX <int> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 17, 18...
## $ TARGET_WINS <int> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72...
## $ TEAM_BATTING_H <int> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 13...
## $ TEAM_BATTING_2B <int> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, ...
## $ TEAM_BATTING_3B <int> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41...
## $ TEAM_BATTING_HR <int> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, ...
## $ TEAM_BATTING_BB <int> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, ...
## $ TEAM_BATTING_SO <int> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 82...
## $ TEAM_BASERUN_SB <int> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, ...
## $ TEAM_BASERUN_CS <int> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 10...
## $ TEAM_BATTING_HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ TEAM_PITCHING_H <int> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 13...
## $ TEAM_PITCHING_HR <int> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, ...
## $ TEAM_PITCHING_BB <int> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, ...
## $ TEAM_PITCHING_SO <int> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 8...
## $ TEAM_FIELDING_E <int> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131,...
## $ TEAM_FIELDING_DP <int> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 1...
Sample 6 rows with sample 7 columns
head(mtd)
Show entire dataset
DT::datatable(mtd, options = list(pagelength=5))
#DT::datatable(med, options = list(pagelength=5))
Structure of data
paste("Dimension of dataset", dim(mtd))
## [1] "Dimension of dataset 2276" "Dimension of dataset 17"
paste("Count of dataset", count(mtd))
## [1] "Count of dataset 2276"
summary(mtd)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
describe(mtd)
## mtd
##
## 17 Variables 2276 Observations
## --------------------------------------------------------------------------------
## INDEX
## n missing distinct Info Mean Gmd .05 .10
## 2276 0 2276 1 1268 850.4 125.8 252.5
## .25 .50 .75 .90 .95
## 630.8 1270.5 1915.5 2287.5 2407.2
##
## lowest : 1 2 3 4 5, highest: 2531 2532 2533 2534 2535
## --------------------------------------------------------------------------------
## TARGET_WINS
## n missing distinct Info Mean Gmd .05 .10
## 2276 0 108 1 80.79 17.47 54.0 61.0
## .25 .50 .75 .90 .95
## 71.0 82.0 92.0 99.5 104.0
##
## lowest : 0 12 14 17 21, highest: 128 129 134 135 146
## --------------------------------------------------------------------------------
## TEAM_BATTING_H
## n missing distinct Info Mean Gmd .05 .10
## 2276 0 569 1 1469 149.8 1282 1315
## .25 .50 .75 .90 .95
## 1383 1454 1537 1636 1695
##
## lowest : 891 992 1009 1116 1122, highest: 2333 2343 2372 2496 2554
## --------------------------------------------------------------------------------
## TEAM_BATTING_2B
## n missing distinct Info Mean Gmd .05 .10
## 2276 0 240 1 241.2 52.89 167 182
## .25 .50 .75 .90 .95
## 208 238 273 303 320
##
## lowest : 69 112 113 118 123, highest: 382 392 393 403 458
## --------------------------------------------------------------------------------
## TEAM_BATTING_3B
## n missing distinct Info Mean Gmd .05 .10
## 2276 0 144 1 55.25 30.34 23 27
## .25 .50 .75 .90 .95
## 34 47 72 96 108
##
## lowest : 0 8 9 11 12, highest: 166 190 197 200 223
## --------------------------------------------------------------------------------
## TEAM_BATTING_HR
## n missing distinct Info Mean Gmd .05 .10
## 2276 0 243 1 99.61 69.49 14.0 20.0
## .25 .50 .75 .90 .95
## 42.0 102.0 147.0 179.5 199.0
##
## lowest : 0 3 4 5 6, highest: 247 249 257 260 264
## --------------------------------------------------------------------------------
## TEAM_BATTING_BB
## n missing distinct Info Mean Gmd .05 .10
## 2276 0 533 1 501.6 130.1 248.2 363.5
## .25 .50 .75 .90 .95
## 451.0 512.0 580.0 635.0 670.2
##
## lowest : 0 12 29 34 45, highest: 815 819 824 860 878
## --------------------------------------------------------------------------------
## TEAM_BATTING_SO
## n missing distinct Info Mean Gmd .05 .10
## 2174 102 822 1 735.6 282.2 359 421
## .25 .50 .75 .90 .95
## 548 750 930 1049 1103
##
## lowest : 0 66 67 72 74, highest: 1303 1320 1326 1335 1399
## --------------------------------------------------------------------------------
## TEAM_BASERUN_SB
## n missing distinct Info Mean Gmd .05 .10
## 2145 131 348 1 124.8 87.96 35.0 44.0
## .25 .50 .75 .90 .95
## 66.0 101.0 156.0 231.0 301.8
##
## lowest : 0 14 18 19 20, highest: 562 567 632 654 697
## --------------------------------------------------------------------------------
## TEAM_BASERUN_CS
## n missing distinct Info Mean Gmd .05 .10
## 1504 772 128 1 52.8 23.24 24 30
## .25 .50 .75 .90 .95
## 38 49 62 77 91
##
## lowest : 0 7 11 12 14, highest: 171 186 193 200 201
## --------------------------------------------------------------------------------
## TEAM_BATTING_HBP
## n missing distinct Info Mean Gmd .05 .10
## 191 2085 55 0.999 59.36 14.61 40.0 44.0
## .25 .50 .75 .90 .95
## 50.5 58.0 67.0 76.0 82.5
##
## lowest : 29 30 35 38 39, highest: 87 88 89 90 95
## --------------------------------------------------------------------------------
## TEAM_PITCHING_H
## n missing distinct Info Mean Gmd .05 .10
## 2276 0 843 1 1779 628.1 1316 1356
## .25 .50 .75 .90 .95
## 1419 1518 1682 2058 2563
##
## lowest : 1137 1168 1184 1187 1202, highest: 16038 16871 20088 24057 30132
## --------------------------------------------------------------------------------
## TEAM_PITCHING_HR
## n missing distinct Info Mean Gmd .05 .10
## 2276 0 256 1 105.7 70.02 18.0 25.0
## .25 .50 .75 .90 .95
## 50.0 107.0 150.0 187.0 209.2
##
## lowest : 0 3 4 5 6, highest: 291 297 301 320 343
## --------------------------------------------------------------------------------
## TEAM_PITCHING_BB
## n missing distinct Info Mean Gmd .05 .10
## 2276 0 535 1 553 140.7 377.0 417.5
## .25 .50 .75 .90 .95
## 476.0 536.5 611.0 693.5 757.0
##
## lowest : 0 119 124 131 140, highest: 2169 2396 2840 2876 3645
## --------------------------------------------------------------------------------
## TEAM_PITCHING_SO
## n missing distinct Info Mean Gmd .05 .10
## 2174 102 823 1 817.7 316.9 421.3 490.0
## .25 .50 .75 .90 .95
## 615.0 813.5 968.0 1095.0 1173.0
##
## lowest : 0 181 205 208 252, highest: 3450 4224 5456 12758 19278
##
## Value 0 200 400 600 800 1000 1200 1400 1600 1800 2200
## Frequency 20 7 211 554 593 580 156 35 7 2 1
## Proportion 0.009 0.003 0.097 0.255 0.273 0.267 0.072 0.016 0.003 0.001 0.000
##
## Value 2400 3400 4200 5400 12800 19200
## Frequency 3 1 1 1 1 1
## Proportion 0.001 0.000 0.000 0.000 0.000 0.000
##
## For the frequency table, variable is rounded to the nearest 200
## --------------------------------------------------------------------------------
## TEAM_FIELDING_E
## n missing distinct Info Mean Gmd .05 .10
## 2276 0 549 1 246.5 190.4 100.0 109.0
## .25 .50 .75 .90 .95
## 127.0 159.0 249.2 542.0 716.0
##
## lowest : 65 66 68 72 74, highest: 1567 1728 1740 1890 1898
## --------------------------------------------------------------------------------
## TEAM_FIELDING_DP
## n missing distinct Info Mean Gmd .05 .10
## 1990 286 144 1 146.4 29.29 98 109
## .25 .50 .75 .90 .95
## 131 149 164 178 186
##
## lowest : 52 64 68 71 72, highest: 215 218 219 225 228
## --------------------------------------------------------------------------------
names(mtd)
## [1] "INDEX" "TARGET_WINS" "TEAM_BATTING_H" "TEAM_BATTING_2B"
## [5] "TEAM_BATTING_3B" "TEAM_BATTING_HR" "TEAM_BATTING_BB" "TEAM_BATTING_SO"
## [9] "TEAM_BASERUN_SB" "TEAM_BASERUN_CS" "TEAM_BATTING_HBP" "TEAM_PITCHING_H"
## [13] "TEAM_PITCHING_HR" "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E"
## [17] "TEAM_FIELDING_DP"
str(mtd)
## 'data.frame': 2276 obs. of 17 variables:
## $ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
## $ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_BATTING_HBP: int NA NA NA NA NA NA NA NA NA NA ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
mtd %>%
summary() %>%
kable() %>%
kable_styling()
| INDEX | TARGET_WINS | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Min. : 0.00 | Min. : 891 | Min. : 69.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. :29.00 | Min. : 1137 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 65.0 | Min. : 52.0 | |
| 1st Qu.: 630.8 | 1st Qu.: 71.00 | 1st Qu.:1383 | 1st Qu.:208.0 | 1st Qu.: 34.00 | 1st Qu.: 42.00 | 1st Qu.:451.0 | 1st Qu.: 548.0 | 1st Qu.: 66.0 | 1st Qu.: 38.0 | 1st Qu.:50.50 | 1st Qu.: 1419 | 1st Qu.: 50.0 | 1st Qu.: 476.0 | 1st Qu.: 615.0 | 1st Qu.: 127.0 | 1st Qu.:131.0 | |
| Median :1270.5 | Median : 82.00 | Median :1454 | Median :238.0 | Median : 47.00 | Median :102.00 | Median :512.0 | Median : 750.0 | Median :101.0 | Median : 49.0 | Median :58.00 | Median : 1518 | Median :107.0 | Median : 536.5 | Median : 813.5 | Median : 159.0 | Median :149.0 | |
| Mean :1268.5 | Mean : 80.79 | Mean :1469 | Mean :241.2 | Mean : 55.25 | Mean : 99.61 | Mean :501.6 | Mean : 735.6 | Mean :124.8 | Mean : 52.8 | Mean :59.36 | Mean : 1779 | Mean :105.7 | Mean : 553.0 | Mean : 817.7 | Mean : 246.5 | Mean :146.4 | |
| 3rd Qu.:1915.5 | 3rd Qu.: 92.00 | 3rd Qu.:1537 | 3rd Qu.:273.0 | 3rd Qu.: 72.00 | 3rd Qu.:147.00 | 3rd Qu.:580.0 | 3rd Qu.: 930.0 | 3rd Qu.:156.0 | 3rd Qu.: 62.0 | 3rd Qu.:67.00 | 3rd Qu.: 1682 | 3rd Qu.:150.0 | 3rd Qu.: 611.0 | 3rd Qu.: 968.0 | 3rd Qu.: 249.2 | 3rd Qu.:164.0 | |
| Max. :2535.0 | Max. :146.00 | Max. :2554 | Max. :458.0 | Max. :223.00 | Max. :264.00 | Max. :878.0 | Max. :1399.0 | Max. :697.0 | Max. :201.0 | Max. :95.00 | Max. :30132 | Max. :343.0 | Max. :3645.0 | Max. :19278.0 | Max. :1898.0 | Max. :228.0 | |
| NA | NA | NA | NA | NA | NA | NA | NA’s :102 | NA’s :131 | NA’s :772 | NA’s :2085 | NA | NA | NA | NA’s :102 | NA | NA’s :286 |
train <- mtd
test <- med
train$INDEX <- NULL
test$INDEX <- NULL
cleanNames <- function(train) {
name_list <- names(train)
name_list <- gsub("TEAM_", "", name_list)
names(train) <- name_list
train
}
mtd <- cleanNames(train)
med <- cleanNames(test)
Visualize the data
mtd %>%
gather(variable, value, TARGET_WINS:FIELDING_DP) %>%
ggplot(., aes(value)) +
geom_density(fill = "indianred4", color="indianred4") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = element_blank())
In the histogram plot above, we see that the batting, pitching home-run and batting strike-out variables are bi modal. TARGET_WINS and TEAM_BATTING_2B has most the normal distribution. PITCHING_H and PITCHING_SO have the most skewed data distribution. The skewed graphs are all rght-skewed except BATTING_BB.
scatterplot3d(mtd$TARGET_WINS, mtd$BATTING_2B, mtd$BATTING_BB, pch = 20, highlight.3d = TRUE, type = "h", main = "3D ScatterPlots")
The above 3-D scatter plot, shows the data variance between the TARGET_WINS, TEAM_BATTING_2B and TEAM_BATTING_BB to provide a comparative view.
par(mfrow=c(3,2))
for (i in 1:16) {
hist(mtd[,i],main=names(mtd[i]),xlab=names(mtd[i]),breaks = 51)
boxplot(mtd[,i], main=names(mtd[i]), type="l",horizontal = TRUE)
plot(mtd[,i], mtd$TARGET_WINS, main = names(mtd[i]), xlab=names(mtd[i]))
abline(lm(mtd$TARGET_WINS ~ mtd[,i], data = mtd), col = "blue")
}
As can be seen from above histogram, boxplot and scatter plot with regression line shows the spread of the data points. More than half of the variables show skewness. A box-cox transformation may help to mitigate the skewness.
Missing or NA Values
We are trying to see how many NA is present in the dataset.
mtd %>%
gather(variable, value) %>%
filter(is.na(value)) %>%
group_by(variable) %>%
tally() %>%
mutate(percent = n / nrow(mtd) * 100) %>%
mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
arrange(desc(n)) %>%
# rename(`Variable Missing Data`=variable,`Number of Records`=n,`Share of Total`=percent) %>%
kable() %>%
kable_styling()
| variable | n | percent |
|---|---|---|
| BATTING_HBP | 2085 | 92% |
| BASERUN_CS | 772 | 34% |
| FIELDING_DP | 286 | 13% |
| BASERUN_SB | 131 | 5.8% |
| BATTING_SO | 102 | 4.5% |
| PITCHING_SO | 102 | 4.5% |
The variable BATTING_HBP (hit by pitcher) is missing over 90% of it’s data.
Zero Values
mtd %>%
gather(variable, value) %>%
filter(value == 0) %>%
group_by(variable) %>%
tally() %>%
mutate(percent = n / nrow(mtd) * 100) %>%
mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
arrange(desc(n)) %>%
# rename("Variable With Zeros"=variable,"Number of Records"=n,"Share of Total"=percent) %>%
kable() %>%
kable_styling()
| variable | n | percent |
|---|---|---|
| BATTING_SO | 20 | 0.9% |
| PITCHING_SO | 20 | 0.9% |
| BATTING_HR | 15 | 0.7% |
| PITCHING_HR | 15 | 0.7% |
| BASERUN_SB | 2 | 0.1% |
| BATTING_3B | 2 | 0.1% |
| BASERUN_CS | 1 | 0% |
| BATTING_BB | 1 | 0% |
| PITCHING_BB | 1 | 0% |
| TARGET_WINS | 1 | 0% |
As can be inferred from above, there are Very few zero values exists.
Checking for outliers
ggplot(stack(mtd), aes(x = ind, y = values)) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 2500)) +
theme(legend.position="none") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(panel.background = element_rect(fill = 'grey'))
The box plots reveal that a great majority of the explanatory variables have high variances. Many of the medians and means are also not aligned which demonstrates the outliers’ effects.
The variance of some of the explanatory variables greatly exceeds the variance of the response “win” variable. The dataset has many outlines with some observations that are more extreme than the 1.5 * IQR of the box plot whiskers.
Checking for skewness in the data
melt(mtd) %>%
ggplot(aes(x= value)) +
geom_density(fill='red') + facet_wrap(~variable, scales = 'free')
As per above, there are several variables like PITCHING_H, PITCHING_BB, PITCHING_SO and FIELDING_E are extremely skewed as there are many outliers.
Finding correlations: Below shows the comparative correlations between the 16 variables as it shows the correlation coefficients and thus find correlated variables. Whichever adhere to a fitted straight red line well, ie. change in synch with each other. If the points lie close to the line but the line is curved, it’s good nonlinear association and one can still be defined by other. Each individual plot shows the relationship between the variable in the horizontal vs the vertical of the grid. Each individual plot shows the relationship between the variable in the horizontal vs the vertical of the grid, whereas the diagonal is showing a histogram of each variable.
DT::datatable(cor(drop_na(mtd[,])), options = list(pagelength=5))
pairs.panels((mtd[,])[1:8])
As can be seen from above, TARGET_WINS vs BATTING_2B is continuous and hence correlated and so is BATTING_BB and BATTING_HR.
pairs.panels((mtd[,])[9:16])
As can be seen from above, BASERUN_CS vs BATTING_HBP is continuous and hence correlated whereas PITCHING_SO and FIELDING_E is not correlated at all.
cor_res <- cor(mtd, use = "complete.obs")
mtd %>%
cor(., use = "complete.obs") %>%
corrplot(., method = "color", type = "upper", tl.col = "black", diag = FALSE)
Also, there are some negatively correlated variables. According to the correlation heatmap, the values that correspond most positively are BATTING_H, BATTING_2B, BATTING_HR, BATTING_BB, PITCHING_H, PITCHING_HR, and PITCHING_BB.
mtd %>%
gather(variable, value, -TARGET_WINS) %>%
ggplot(., aes(value, TARGET_WINS)) +
geom_point(fill = "indianred4", color="indianred4") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = "Wins")
Above shows how the data is distributed when compared to the linear regression. Clearly, PITCHING_H and PITCHING_SO are highly heteroscedastic. Comparatively, BATTING_HBP is most homoscedastic.
cor_res[,1:2]
## TARGET_WINS BATTING_H
## TARGET_WINS 1.00000000 0.46994665
## BATTING_H 0.46994665 1.00000000
## BATTING_2B 0.31298400 0.56177286
## BATTING_3B -0.12434586 0.21391883
## BATTING_HR 0.42241683 0.39627593
## BATTING_BB 0.46868793 0.19735234
## BATTING_SO -0.22889273 -0.34174328
## BASERUN_SB 0.01483639 0.07167495
## BASERUN_CS -0.17875598 -0.09377545
## BATTING_HBP 0.07350424 -0.02911218
## PITCHING_H 0.47123431 0.99919269
## PITCHING_HR 0.42246683 0.39495630
## PITCHING_BB 0.46839882 0.19529071
## PITCHING_SO -0.22936481 -0.34445001
## FIELDING_E -0.38668800 -0.25381638
## FIELDING_DP -0.19586601 0.01776946
Above shows the correlation coefficient of each variable compared to TARGET_WINS and BATTING_H.
Histogram of Variables
hist.data.frame(mtd)
par(mfrow=c(2,3))
plot(TARGET_WINS ~ BATTING_H,mtd)
abline(lm(TARGET_WINS ~ BATTING_H,data = mtd),col="blue")
plot(TARGET_WINS ~ BATTING_2B,mtd)
abline(lm(TARGET_WINS ~ BATTING_2B,data = mtd),col="blue")
plot(TARGET_WINS ~ BATTING_3B,mtd)
abline(lm(TARGET_WINS ~ BATTING_3B,data = mtd),col="blue")
plot(TARGET_WINS ~ BATTING_HR,mtd)
abline(lm(TARGET_WINS ~ BATTING_HR,data = mtd),col="blue")
plot(TARGET_WINS ~ BATTING_BB,mtd)
abline(lm(TARGET_WINS ~ BATTING_BB,data = mtd),col="blue")
plot(TARGET_WINS ~ BATTING_SO,mtd)
abline(lm(TARGET_WINS ~ BATTING_SO,data = mtd),col="blue")
plot(TARGET_WINS ~ BASERUN_SB,mtd)
abline(lm(TARGET_WINS ~ BASERUN_SB,data = mtd),col="blue")
plot(TARGET_WINS ~ BASERUN_CS,mtd)
abline(lm(TARGET_WINS ~ BASERUN_CS,data = mtd),col="blue")
plot(TARGET_WINS ~ PITCHING_H,mtd)
abline(lm(TARGET_WINS ~ PITCHING_H,data = mtd),col="blue")
plot(TARGET_WINS ~ PITCHING_HR,mtd)
abline(lm(TARGET_WINS ~ PITCHING_HR,data = mtd),col="blue")
plot(TARGET_WINS ~ PITCHING_BB,mtd)
abline(lm(TARGET_WINS ~ PITCHING_BB,data = mtd),col="blue")
plot(TARGET_WINS ~ PITCHING_SO,mtd)
abline(lm(TARGET_WINS ~ PITCHING_SO,data = mtd),col="blue")
plot(TARGET_WINS ~ FIELDING_E,mtd)
abline(lm(TARGET_WINS ~ FIELDING_E,data = mtd),col="blue")
plot(TARGET_WINS ~ FIELDING_DP,mtd)
abline(lm(TARGET_WINS ~ FIELDING_DP,data = mtd),col="blue")
This shows very few variables are normally distributed.
3.0.1 Missing value by Graph
Here will see how much of data is missing in each predictors.
vis_miss(mtd)
Here from the plots we can see outliers in PITCHING_H,PITCHING_BB and PITCHING_SO
Also, since BATTING_H is a combination of BATTING_2B, BATTING_3B, BATTING_HR (and also includes batted singles), we will create a new variable BATTING_1B equaling BATTING_H - BATTING_2B - BATTING_3B - BATTING_HR and after creating this we will remove BATTING_H
Initial Observations
- Response variable (TARGET_WINS) looks to be normally distributed which means there are good teams, bad teams as well as average teams.
- There are also quite a few variables with missing values. We may need to deal with these in order to have the largest data set possible for modeling.
- A couple variables are bimodal (TEAM_BATTING_HR, TEAM_BATTING_SO, TEAM_PITCHING_HR). This may be a challenge as some of them are missing values and that may be a challenge in filling in missing values.
- Some variables are right skewed (TEAM_BASERUN_CS, TEAM_BASERUN_SB, etc.). This might support the good team theory. It may also introduce non-normally distributed residuals in the model. We shall see.
- Dataset covers a wide time period spanning across multiple “eras” of baseball.
4 DATA PREPARATION
Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations. a. Fix missing values (maybe with a Mean or Median value) b. Create flags to suggest if a variable was missing c. Transform data by putting it into buckets d. Mathematical transforms such as log or square root (or use Box-Cox) e. Combine variables (such as ratios or adding or multiplying) to create new variables
Fixing Missing/Zero Values - Remove the invalid data and prep it for imputation. - We could “discard” the TEAM_BATTING_HBP,due to the high percentage of missing data; particularly, replacing it by “ZERO” should not be advisable since the minimum value recorded is 29 and replacing it with a median value would not be much helpful due to high percentage of missing values. We decided not to consider this variable for our study. - A typical professional league baseball game has 9 innings (extra innings come to play in the event of a tie) in length, and in each inning one can only pitch 3 strikeouts. There have been a maximum of 27 potential strikeouts upto a maximum of by 162 games for each of the 30 teams in the American League (AL) and National League (NL), played over approximately six months in Major League Baseball (MLB) season. Therefore having more than 4374 strikeouts (9x3x162) is not possible. Incidentally, the maximum strikeouts in any baseball season has been 513 by Matt Kilroy in the year 1886 as part of Baltimore Orioles within American Association League,
remove_bad_values <- function(df){
# Change 0's to NA so they too can be imputed
df <- df %>% mutate(BATTING_SO = ifelse(BATTING_SO == 0, NA, BATTING_SO))
# Remove the high pitching strikeout values
df[which(df$PITCHING_SO > 4374),"PITCHING_SO"] <- NA
# Drop the hit by pitcher variable
df %>% select(-BATTING_HBP)
}
mtd <- remove_bad_values(mtd)
med <- remove_bad_values(med) %>% na.omit()
Imputing the values using KNN
set.seed(42)
knn <- mtd %>% DMwR::knnImputation()
impute_values <- function(df, knn){
impute_me <- is.na(df$BATTING_SO)
df[impute_me,"BATTING_SO"] <- knn[impute_me,"BATTING_SO"]
impute_me <- is.na(df$BASERUN_SB)
df[impute_me,"BASERUN_SB"] <- knn[impute_me,"BASERUN_SB"]
impute_me <- is.na(df$BASERUN_CS)
df[impute_me,"BASERUN_CS"] <- knn[impute_me,"BASERUN_CS"]
impute_me <- is.na(df$PITCHING_SO)
df[impute_me,"PITCHING_SO"] <- knn[impute_me,"PITCHING_SO"]
impute_me <- is.na(df$FIELDING_DP)
df[impute_me,"FIELDING_DP"] <- knn[impute_me,"FIELDING_DP"]
return(df)
}
imputed_mtd_Data <- impute_values(mtd, knn)
# Including batting singles
add_features <- function(df){
df %>%
mutate(BATTING_1B = BATTING_H - BATTING_2B - BATTING_3B - BATTING_HR)
}
mtd <- add_features(mtd)
med <- add_features(med)
5 BUILD MODELS
Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
set.seed(42)
train_index <- createDataPartition(mtd$TARGET_WINS, p = .7, list = FALSE, times = 1)
moneyball_train <- mtd[train_index,]
moneyball_test <- mtd[-train_index,]
5.1 Model 1 : Kitchen Sink Model/Backward Elimination
With all variables to determine the base model provided. This would allow to see which variables are significant in our dataset, and allows to make other models based on that.
# Result to hold all the main info about model
result<- data.frame("ModelName"=NA,"Variables"=NA,"Removed"=NA,"Adjusted R2"=NA,"P-Value" =NA, "AIC"= NA , "Note"= NA)
model1 <- lm(TARGET_WINS ~., data=moneyball_train)
summary(model1)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = moneyball_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.0724 -6.5828 -0.1407 6.4786 28.3847
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.53113 7.79100 7.513 1.25e-13 ***
## BATTING_H 0.01653 0.02346 0.704 0.481330
## BATTING_2B -0.07540 0.01100 -6.854 1.23e-11 ***
## BATTING_3B 0.17325 0.02552 6.789 1.90e-11 ***
## BATTING_HR 0.13176 0.09460 1.393 0.163944
## BATTING_BB 0.02796 0.05440 0.514 0.607397
## BATTING_SO 0.01254 0.02769 0.453 0.650670
## BASERUN_SB 0.03694 0.01026 3.600 0.000334 ***
## BASERUN_CS 0.05115 0.02196 2.329 0.020032 *
## PITCHING_H 0.01747 0.02210 0.791 0.429325
## PITCHING_HR -0.02926 0.09070 -0.323 0.747075
## PITCHING_BB 0.01110 0.05237 0.212 0.832216
## PITCHING_SO -0.03241 0.02645 -1.225 0.220789
## FIELDING_E -0.16207 0.01230 -13.176 < 2e-16 ***
## FIELDING_DP -0.10625 0.01545 -6.875 1.07e-11 ***
## BATTING_1B NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.469 on 1037 degrees of freedom
## (543 observations deleted due to missingness)
## Multiple R-squared: 0.4421, Adjusted R-squared: 0.4346
## F-statistic: 58.7 on 14 and 1037 DF, p-value: < 2.2e-16
# Storing data for future ref
result <- rbind(result,
c("ModelName" = "model1",
"Variables" = paste0(formula(model1)[3]),
"Removed"= NA,
"Adjusted R2" = round(summary(model1)$adj.r.squared,4),
"P-Value" = glance(model1)$p.value,
"AIC" = glance(model1)$AIC,
"Note"= "BATTING_2B,BATTING_3B,BASERUN_SB ,BASERUN_CS,FIELDING_E,FIELDING_DP"))
It does a fairly good job predicting, but there are a lot of variables that are not statistically significant. We see the that P-value is less than .05 which makes it one of the possible model but not all the coefficients of the model1 are significant.
5.2 Model 2 : Simple Model
With only the significant variables: Pick variables that had high correlations and include the pitching variables
model2 <- lm(TARGET_WINS ~ BATTING_H + BATTING_3B + BATTING_HR + BATTING_BB + BATTING_SO +
BASERUN_SB + PITCHING_SO + PITCHING_H + PITCHING_SO +
FIELDING_E + FIELDING_DP, data=moneyball_train)
summary(model2)
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_H + BATTING_3B + BATTING_HR +
## BATTING_BB + BATTING_SO + BASERUN_SB + PITCHING_SO + PITCHING_H +
## PITCHING_SO + FIELDING_E + FIELDING_DP, data = moneyball_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.633 -7.407 0.103 7.218 29.771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.346701 6.624503 11.072 < 2e-16 ***
## BATTING_H -0.036127 0.012857 -2.810 0.005032 **
## BATTING_3B 0.201222 0.022342 9.007 < 2e-16 ***
## BATTING_HR 0.114499 0.010869 10.535 < 2e-16 ***
## BATTING_BB 0.032347 0.003796 8.522 < 2e-16 ***
## BATTING_SO 0.048172 0.020693 2.328 0.020072 *
## BASERUN_SB 0.074635 0.006672 11.186 < 2e-16 ***
## PITCHING_SO -0.071270 0.019581 -3.640 0.000284 ***
## PITCHING_H 0.043819 0.011707 3.743 0.000190 ***
## FIELDING_E -0.111738 0.008436 -13.245 < 2e-16 ***
## FIELDING_DP -0.105429 0.014630 -7.206 9.77e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.29 on 1286 degrees of freedom
## (298 observations deleted due to missingness)
## Multiple R-squared: 0.3949, Adjusted R-squared: 0.3902
## F-statistic: 83.92 on 10 and 1286 DF, p-value: < 2.2e-16
# Storing data for future ref
result <- rbind(result,
c("ModelName" = "model2",
"Variables" = paste0(formula(model2)[3]),
"Removed"= NA,
"Adjusted R2" = round(summary(model2)$adj.r.squared,4),
"P-Value" = glance(model2)$p.value,
"AIC" = glance(model2)$AIC,
"Note"= "All are significant"))
5.3 Model 3 : Higher Order Stepwise Regression
Only taking the variable from the Model1 that are really significant.
model3a <- lm(TARGET_WINS~BATTING_2B+BATTING_3B+BASERUN_SB+BASERUN_CS+FIELDING_E+FIELDING_DP, data=moneyball_train)
summary(model3a)
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_2B + BATTING_3B + BASERUN_SB +
## BASERUN_CS + FIELDING_E + FIELDING_DP, data = moneyball_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.0056 -7.9628 -0.3434 8.0241 30.3356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.226932 4.171175 22.350 <2e-16 ***
## BATTING_2B 0.019018 0.008810 2.159 0.0311 *
## BATTING_3B 0.273238 0.025450 10.736 <2e-16 ***
## BASERUN_SB 0.018523 0.011820 1.567 0.1174
## BASERUN_CS 0.007483 0.025892 0.289 0.7726
## FIELDING_E -0.169187 0.013894 -12.177 <2e-16 ***
## FIELDING_DP -0.043599 0.018145 -2.403 0.0164 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.44 on 1045 degrees of freedom
## (543 observations deleted due to missingness)
## Multiple R-squared: 0.1794, Adjusted R-squared: 0.1747
## F-statistic: 38.08 on 6 and 1045 DF, p-value: < 2.2e-16
# Storing data for future ref
result <- rbind(result,
c("ModelName" = "model3a",
"Variables" = paste0(formula(model3a)[3]),
"Removed"= NA,
"Adjusted R2" = round(summary(model3a)$adj.r.squared,4),
"P-Value" = glance(model3a)$p.value,
"AIC" = glance(model3a)$AIC,
"Note"= "BATTING_3B,FIELDING_E ,BATTING_2B,FIELDING_DP are significant"))
model3b <- lm(TARGET_WINS~BATTING_3B + FIELDING_E + BATTING_2B + FIELDING_DP, data=moneyball_train)
summary(model3b)
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_3B + FIELDING_E + BATTING_2B +
## FIELDING_DP, data = moneyball_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.154 -9.095 0.359 8.972 47.276
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.11824 3.17547 23.026 < 2e-16 ***
## BATTING_3B 0.15080 0.01793 8.411 < 2e-16 ***
## FIELDING_E -0.02936 0.00371 -7.913 5.08e-15 ***
## BATTING_2B 0.06870 0.00816 8.418 < 2e-16 ***
## FIELDING_DP -0.07547 0.01579 -4.780 1.94e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.17 on 1396 degrees of freedom
## (194 observations deleted due to missingness)
## Multiple R-squared: 0.1159, Adjusted R-squared: 0.1134
## F-statistic: 45.75 on 4 and 1396 DF, p-value: < 2.2e-16
result <- rbind(result,
c("ModelName" = "model3b",
"Variables" = paste0(formula(model3b)[3]),
"Removed"= NA,
"Adjusted R2" = round(summary(model3b)$adj.r.squared,4),
"P-Value" = glance(model3b)$p.value,
"AIC" = glance(model3b)$AIC,
"Note"= "All are significant"))
Further reducing the variables(TEAM_PITCHING_SO and TEAM_BATTING_SO are having high correlation, TEAM_BATTING_H and TEAM_PITCHING_H are also having high correlation, TEAM_BATTING_SO and TEAM_PITCHING_SO are also having high correlation):
model3 <- lm(TARGET_WINS ~ BATTING_1B + BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB + BATTING_SO +
BASERUN_SB + BASERUN_CS +
PITCHING_H + PITCHING_HR + PITCHING_BB + PITCHING_SO +
FIELDING_E + FIELDING_DP, data=moneyball_train)
#+I(BATTING_1B^2) + I(BATTING_2B^2) + I(BATTING_3B^2) + I(BATTING_HR^2) + I(BATTING_BB^2) + I(BATTING_SO^2) +
#+I(BASERUN_SB^2) + I(BASERUN_CS^2) +
#+I(PITCHING_H^2) + I(PITCHING_HR^2) + I(PITCHING_BB^2) + I(PITCHING_SO^2) +
#+I(FIELDING_E^2) + I(FIELDING_DP^2) +
#+I(BATTING_2B^3) + I(BATTING_3B^3) + I(BATTING_HR^3) + I(BATTING_BB^3) + I(BATTING_SO^3) +
#+I(BASERUN_SB^3) + I(BASERUN_CS^3) +
#+I(PITCHING_H^3) + I(PITCHING_HR^3) + I(PITCHING_BB^3) + I(PITCHING_SO^3) +
#+I(FIELDING_E^3) + I(FIELDING_DP^3) +
#+I(BATTING_1B^3) + I(BATTING_2B^4) + I(BATTING_3B^4) + I(BATTING_HR^4) + I(BATTING_BB^4) + I(BATTING_SO^4) +
#+I(BASERUN_SB^4) + I(BASERUN_CS^4) +
#+I(PITCHING_H^4) + I(PITCHING_HR^4) + I(PITCHING_BB^4) + I(PITCHING_SO^4) +
#+I(FIELDING_E^4) + I(FIELDING_DP^4) + I(BATTING_1B^4)
summary(model3)
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_1B + BATTING_2B + BATTING_3B +
## BATTING_HR + BATTING_BB + BATTING_SO + BASERUN_SB + BASERUN_CS +
## PITCHING_H + PITCHING_HR + PITCHING_BB + PITCHING_SO + FIELDING_E +
## FIELDING_DP, data = moneyball_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.0724 -6.5828 -0.1407 6.4786 28.3847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.53113 7.79100 7.513 1.25e-13 ***
## BATTING_1B 0.01653 0.02346 0.704 0.481330
## BATTING_2B -0.05888 0.02461 -2.392 0.016923 *
## BATTING_3B 0.18978 0.03303 5.746 1.20e-08 ***
## BATTING_HR 0.14829 0.10060 1.474 0.140776
## BATTING_BB 0.02796 0.05440 0.514 0.607397
## BATTING_SO 0.01254 0.02769 0.453 0.650670
## BASERUN_SB 0.03694 0.01026 3.600 0.000334 ***
## BASERUN_CS 0.05115 0.02196 2.329 0.020032 *
## PITCHING_H 0.01747 0.02210 0.791 0.429325
## PITCHING_HR -0.02926 0.09070 -0.323 0.747075
## PITCHING_BB 0.01110 0.05237 0.212 0.832216
## PITCHING_SO -0.03241 0.02645 -1.225 0.220789
## FIELDING_E -0.16207 0.01230 -13.176 < 2e-16 ***
## FIELDING_DP -0.10625 0.01545 -6.875 1.07e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.469 on 1037 degrees of freedom
## (543 observations deleted due to missingness)
## Multiple R-squared: 0.4421, Adjusted R-squared: 0.4346
## F-statistic: 58.7 on 14 and 1037 DF, p-value: < 2.2e-16
result <- rbind(result,
c("ModelName" = "model3",
"Variables" = paste0(formula(model3)[3]),
"Removed"= NA,
"Adjusted R2" = round(summary(model3)$adj.r.squared,4),
"P-Value" = glance(model3)$p.value,
"AIC" = glance(model3)$AIC,
"Note"= "Nothing is significant"))
# StepBack Model
step_back <- MASS::stepAIC(model3, direction="backward", trace = F)
poly_call <- summary(step_back)$call
step_back <- lm(poly_call[2], moneyball_train)
summary(step_back)
##
## Call:
## lm(formula = poly_call[2], data = moneyball_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.0741 -6.5189 -0.0304 6.5548 28.5287
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.226582 7.718003 7.674 3.83e-14 ***
## BATTING_1B 0.021961 0.006883 3.191 0.001462 **
## BATTING_2B -0.052339 0.008634 -6.062 1.88e-09 ***
## BATTING_3B 0.195353 0.024739 7.897 7.25e-15 ***
## BATTING_HR 0.123437 0.009440 13.077 < 2e-16 ***
## BATTING_BB 0.039462 0.003927 10.048 < 2e-16 ***
## BASERUN_SB 0.036916 0.010210 3.616 0.000314 ***
## BASERUN_CS 0.051264 0.021908 2.340 0.019475 *
## PITCHING_H 0.011846 0.002851 4.155 3.52e-05 ***
## PITCHING_SO -0.020636 0.002747 -7.513 1.25e-13 ***
## FIELDING_E -0.162363 0.012228 -13.278 < 2e-16 ***
## FIELDING_DP -0.106435 0.015427 -6.899 9.07e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.458 on 1040 degrees of freedom
## (543 observations deleted due to missingness)
## Multiple R-squared: 0.4418, Adjusted R-squared: 0.4359
## F-statistic: 74.83 on 11 and 1040 DF, p-value: < 2.2e-16
result <- rbind(result,
c("ModelName" = "step_back",
"Variables" = paste0(formula(step_back)[3]),
"Removed"= NA,
"Adjusted R2" = round(summary(step_back)$adj.r.squared,4),
"P-Value" = glance(step_back)$p.value,
"AIC" = glance(step_back)$AIC,
"Note"= "more vars significant"))
6 SELECT MODELS
Decide on the criteria for selecting the best multiple linear regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.
For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots.
Make predictions using the evaluation data set.
Lets review the result for each our our model:
datatable(result[,-c(2,3)])
6.0.1 Multicolinearity
Lets Evaluate if we have any multicolinearity in our model1s.Multicollinearity (also collinearity) is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy.
We will user alias function to detect the collinearity of all the predictor in the model1.
6.0.1.1 Model 1
alias(model1)
## Model :
## TARGET_WINS ~ BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR +
## BATTING_BB + BATTING_SO + BASERUN_SB + BASERUN_CS + PITCHING_H +
## PITCHING_HR + PITCHING_BB + PITCHING_SO + FIELDING_E + FIELDING_DP +
## BATTING_1B
##
## Complete :
## (Intercept) BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB
## BATTING_1B 0 1 -1 -1 -1 0
## BATTING_SO BASERUN_SB BASERUN_CS PITCHING_H PITCHING_HR PITCHING_BB
## BATTING_1B 0 0 0 0 0 0
## PITCHING_SO FIELDING_E FIELDING_DP
## BATTING_1B 0 0 0
# vif(lm(TARGET_WINS~.,moneyball_train[,-c(2,3,4,5)]))
corrplot(cor(mtd),type = 'upper')
# dput(model1)
Result shows that BATTING_1B is corealted with BATTING_H , BATTING_2B BATTING_3B , BATTING_HR . Here +1 and -1 are indicative of sign of coefecifint of the repstive predictor while stating the value for BATTING_1B.
Corrplot also suggest the same except , it doen’t show high correlation between BATTING_H``BATTING_HR. In our Model2 , we well just follow the p-value significance test and build the model.
# Make predictions
predictions <- model1 %>% predict(moneyball_test)
# Model performance
data.frame(
RMSE = RMSE(predictions, moneyball_test$TARGET_WINS,na.rm = TRUE),
R2 = R2(predictions,moneyball_test$TARGET_WINS,na.rm = TRUE)
)
6.0.2 Model 2
Here alias doen’t suggest any correlated predictor. Now we can run VIF (variance inflation factor), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model. The smallest possible value of VIF is one (absence of multicollinearity). Here we will look for VIF value, if that exceeds 5 or 10 indicates a problematic amount of collinearity. “Read More”[‘http://www.sthda.com/english/articles/39-regression-model-diagnostics/160-multicollinearity-essentials-and-vif-in-r/’]
alias(model2)
## Model :
## TARGET_WINS ~ BATTING_H + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BASERUN_SB + PITCHING_SO + PITCHING_H + PITCHING_SO +
## FIELDING_E + FIELDING_DP
vif(model2)
## BATTING_H BATTING_3B BATTING_HR BATTING_BB BATTING_SO BASERUN_SB
## 23.591594 2.924829 4.274146 1.259010 242.802006 1.539592
## PITCHING_SO PITCHING_H FIELDING_E FIELDING_DP
## 225.307718 48.406757 2.835717 1.353810
VIF output suggest that BATTING_H, PITCHING_H, BATTING_SO,PITCHING_SO are highly impacting model due their colinear relation.
# Make predictions
predictions <- model2 %>% predict(moneyball_test)
# Model performance
data.frame(
RMSE = RMSE(predictions, moneyball_test$TARGET_WINS,na.rm = TRUE),
R2 = R2(predictions,moneyball_test$TARGET_WINS,na.rm = TRUE)
)
6.0.2.1 Model 3
# Make predictions
predictions <- model3 %>% predict(moneyball_test)
# Model performance
data.frame(
RMSE = RMSE(predictions, moneyball_test$TARGET_WINS,na.rm = TRUE),
R2 = R2(predictions,moneyball_test$TARGET_WINS,na.rm = TRUE)
)
6.0.2.2 Model 4
# Model 4
model4 <- lm(TARGET_WINS~. -BATTING_H- BATTING_2B -BATTING_3B- BATTING_HR, data= moneyball_train)
summary(model4)
##
## Call:
## lm(formula = TARGET_WINS ~ . - BATTING_H - BATTING_2B - BATTING_3B -
## BATTING_HR, data = moneyball_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.334 -6.834 -0.136 6.517 29.480
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.857266 8.110353 7.380 3.23e-13 ***
## BATTING_BB 0.006719 0.039339 0.171 0.864410
## BATTING_SO 0.006949 0.022410 0.310 0.756561
## BASERUN_SB 0.035119 0.010675 3.290 0.001036 **
## BASERUN_CS 0.068018 0.022780 2.986 0.002894 **
## PITCHING_H -0.002634 0.006751 -0.390 0.696514
## PITCHING_HR 0.116181 0.012748 9.113 < 2e-16 ***
## PITCHING_BB 0.030035 0.037698 0.797 0.425796
## PITCHING_SO -0.033549 0.021345 -1.572 0.116309
## FIELDING_E -0.127737 0.012193 -10.476 < 2e-16 ***
## FIELDING_DP -0.104855 0.016090 -6.517 1.12e-10 ***
## BATTING_1B 0.038734 0.010312 3.756 0.000182 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.86 on 1040 degrees of freedom
## (543 observations deleted due to missingness)
## Multiple R-squared: 0.3933, Adjusted R-squared: 0.3869
## F-statistic: 61.3 on 11 and 1040 DF, p-value: < 2.2e-16
vif(model4)
## BATTING_BB BATTING_SO BASERUN_SB BASERUN_CS PITCHING_H PITCHING_HR
## 107.539027 216.776484 2.415563 2.721623 14.163628 4.448142
## PITCHING_BB PITCHING_SO FIELDING_E FIELDING_DP BATTING_1B
## 144.662915 216.288753 2.187153 1.133447 7.973818
# Make predictions
predictions <- model4 %>% predict(moneyball_test)
# Model performance
data.frame(
RMSE = RMSE(predictions, moneyball_test$TARGET_WINS,na.rm = TRUE),
R2 = R2(predictions,moneyball_test$TARGET_WINS,na.rm = TRUE)
)
6.0.2.3 Model 5
model5 <- lm(TARGET_WINS~. -PITCHING_SO -PITCHING_BB -BATTING_H- BATTING_2B -BATTING_3B- BATTING_HR, data= moneyball_train)
summary(model5)
##
## Call:
## lm(formula = TARGET_WINS ~ . - PITCHING_SO - PITCHING_BB - BATTING_H -
## BATTING_2B - BATTING_3B - BATTING_HR, data = moneyball_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.408 -6.629 -0.164 6.503 29.704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.129049 8.109072 7.415 2.51e-13 ***
## BATTING_BB 0.038506 0.004083 9.430 < 2e-16 ***
## BATTING_SO -0.027830 0.002911 -9.562 < 2e-16 ***
## BASERUN_SB 0.036013 0.010592 3.400 0.0007 ***
## BASERUN_CS 0.066311 0.022725 2.918 0.0036 **
## PITCHING_H -0.010813 0.002702 -4.002 6.71e-05 ***
## PITCHING_HR 0.123928 0.010677 11.607 < 2e-16 ***
## FIELDING_E -0.128182 0.012162 -10.540 < 2e-16 ***
## FIELDING_DP -0.105752 0.016091 -6.572 7.82e-11 ***
## BATTING_1B 0.049404 0.006386 7.737 2.40e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.87 on 1042 degrees of freedom
## (543 observations deleted due to missingness)
## Multiple R-squared: 0.3909, Adjusted R-squared: 0.3857
## F-statistic: 74.32 on 9 and 1042 DF, p-value: < 2.2e-16
vif(model5)
## BATTING_BB BATTING_SO BASERUN_SB BASERUN_CS PITCHING_H PITCHING_HR
## 1.156266 3.649407 2.373748 2.703075 2.263550 3.113814
## FIELDING_E FIELDING_DP BATTING_1B
## 2.171454 1.131320 3.051488
predictions <- model5 %>% predict(moneyball_test)
# Model performance
data.frame(
RMSE = RMSE(predictions, moneyball_test$TARGET_WINS,na.rm = TRUE),
R2 = R2(predictions,moneyball_test$TARGET_WINS,na.rm = TRUE)
)
6.0.2.4 Step back
VIF result suggest that all the predictors in the model step_back have no multicolinearirty exist in them.
# model5 <- lm(TARGET_WINS~. -PITCHING_SO -PITCHING_BB -BATTING_H- BATTING_2B -BATTING_3B- BATTING_HR, data= moneyball_train)
summary(step_back)
##
## Call:
## lm(formula = poly_call[2], data = moneyball_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.0741 -6.5189 -0.0304 6.5548 28.5287
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.226582 7.718003 7.674 3.83e-14 ***
## BATTING_1B 0.021961 0.006883 3.191 0.001462 **
## BATTING_2B -0.052339 0.008634 -6.062 1.88e-09 ***
## BATTING_3B 0.195353 0.024739 7.897 7.25e-15 ***
## BATTING_HR 0.123437 0.009440 13.077 < 2e-16 ***
## BATTING_BB 0.039462 0.003927 10.048 < 2e-16 ***
## BASERUN_SB 0.036916 0.010210 3.616 0.000314 ***
## BASERUN_CS 0.051264 0.021908 2.340 0.019475 *
## PITCHING_H 0.011846 0.002851 4.155 3.52e-05 ***
## PITCHING_SO -0.020636 0.002747 -7.513 1.25e-13 ***
## FIELDING_E -0.162363 0.012228 -13.278 < 2e-16 ***
## FIELDING_DP -0.106435 0.015427 -6.899 9.07e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.458 on 1040 degrees of freedom
## (543 observations deleted due to missingness)
## Multiple R-squared: 0.4418, Adjusted R-squared: 0.4359
## F-statistic: 74.83 on 11 and 1040 DF, p-value: < 2.2e-16
vif(step_back)
## BATTING_1B BATTING_2B BATTING_3B BATTING_HR BATTING_BB BASERUN_SB
## 3.860683 1.533907 2.592355 2.434721 1.164947 2.401669
## BASERUN_CS PITCHING_H PITCHING_SO FIELDING_E FIELDING_DP
## 2.736003 2.744801 3.892807 2.390615 1.132495
predictions <- step_back %>% predict(moneyball_test)
# Model performance
data.frame(
RMSE = RMSE(predictions, moneyball_test$TARGET_WINS,na.rm = TRUE),
R2 = R2(predictions,moneyball_test$TARGET_WINS,na.rm = TRUE)
)
Lets only consider Model with beter RMSE and R2 and check it with AIC test:
| Model Name | RMSE | R^2 |
|---|---|---|
| model1 | 9.80421 | 0.42556 |
| model2 | 10.2591 | 0.38835 |
| model3 | 10.0631 | 0.40604 |
| model4 | 9.92225 | 0.41098 |
| model5 | 9.99109 | 0.40295 |
| Step Back | 9.77083 | 0.428734 |
bbmle::AICctab(step_back,model4,model5,delta=TRUE, weights=TRUE)
## dAICc df weight
## step_back 0.0 13 1
## model4 87.6 13 <0.001
## model5 87.6 11 <0.001
In Both test Model1 is doing well, but since its not a parsomonious model we decided to check among model4 and model5 and step_back. Which is a parsomonious model, with no multicolnearity among the predictors. We also note how multicolinearity in models were impacting its effect on overall perfromcne of the model.
Selected Model = step_back
6.1 Run the step_backward model on Eval data.
model <- lm(BATTING_H~., data=med)
# StepBack Model
#step_backward_model <- MASS::stepAIC(model, direction="backward", trace = F)
#poly_call <- summary(step_backward_model)$call
#step_backward_model <- lm(poly_call[2], moneyball_train)
#summary(step_backward_model)
step_backward_model <- step (model, direction = "backward")
## Start: AIC=-9677.33
## BATTING_H ~ BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BASERUN_SB + BASERUN_CS + PITCHING_H + PITCHING_HR +
## PITCHING_BB + PITCHING_SO + FIELDING_E + FIELDING_DP + BATTING_1B
##
## Df Sum of Sq RSS AIC
## - BASERUN_CS 1 0.0 0.0 -9680.5
## - PITCHING_BB 1 0.0 0.0 -9679.8
## - FIELDING_E 1 0.0 0.0 -9679.4
## - BATTING_BB 1 0.0 0.0 -9679.3
## - FIELDING_DP 1 0.0 0.0 -9679.1
## - PITCHING_H 1 0.0 0.0 -9678.8
## - BASERUN_SB 1 0.0 0.0 -9678.5
## - PITCHING_HR 1 0.0 0.0 -9677.7
## <none> 0.0 -9677.3
## - BATTING_SO 1 0.0 0.0 -9674.7
## - PITCHING_SO 1 0.0 0.0 -9673.6
## - BATTING_HR 1 196.1 196.1 52.3
## - BATTING_3B 1 4607.5 4607.5 588.9
## - BATTING_2B 1 4715.2 4715.2 592.9
## - BATTING_1B 1 5029.8 5029.8 603.8
##
## Step: AIC=-9680.52
## BATTING_H ~ BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_BB +
## PITCHING_SO + FIELDING_E + FIELDING_DP + BATTING_1B
##
## Df Sum of Sq RSS AIC
## - PITCHING_BB 1 0.0 0.0 -9682.3
## - FIELDING_E 1 0.0 0.0 -9682.3
## - FIELDING_DP 1 0.0 0.0 -9681.8
## - PITCHING_H 1 0.0 0.0 -9681.3
## - BATTING_BB 1 0.0 0.0 -9681.2
## <none> 0.0 -9680.5
## - BASERUN_SB 1 0.0 0.0 -9680.4
## - PITCHING_HR 1 0.0 0.0 -9679.3
## - PITCHING_SO 1 0.0 0.0 -9676.4
## - BATTING_SO 1 0.0 0.0 -9671.8
## - BATTING_HR 1 196.7 196.7 50.8
## - BATTING_3B 1 4616.4 4616.4 587.3
## - BATTING_2B 1 4778.8 4778.8 593.1
## - BATTING_1B 1 5067.4 5067.4 603.1
##
## Step: AIC=-9682.32
## BATTING_H ~ BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_SO +
## FIELDING_E + FIELDING_DP + BATTING_1B
##
## Df Sum of Sq RSS AIC
## - FIELDING_E 1 0 0 -9684.4
## - FIELDING_DP 1 0 0 -9683.8
## <none> 0 -9682.3
## - BATTING_BB 1 0 0 -9682.2
## - BASERUN_SB 1 0 0 -9682.2
## - PITCHING_HR 1 0 0 -9681.4
## - PITCHING_H 1 0 0 -9680.3
## - PITCHING_SO 1 0 0 -9678.1
## - BATTING_SO 1 0 0 -9673.6
## - BATTING_HR 1 200 200 51.6
## - BATTING_3B 1 14322 14322 777.7
## - BATTING_2B 1 25270 25270 874.3
## - BATTING_1B 1 31677 31677 912.7
##
## Step: AIC=-9684.37
## BATTING_H ~ BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_SO +
## FIELDING_DP + BATTING_1B
##
## Df Sum of Sq RSS AIC
## - FIELDING_DP 1 0 0 -9686.3
## <none> 0 -9684.4
## - BATTING_BB 1 0 0 -9684.3
## - PITCHING_H 1 0 0 -9684.2
## - PITCHING_HR 1 0 0 -9684.0
## - BASERUN_SB 1 0 0 -9683.6
## - PITCHING_SO 1 0 0 -9679.8
## - BATTING_SO 1 0 0 -9675.9
## - BATTING_HR 1 203 203 52.6
## - BATTING_3B 1 15294 15294 786.9
## - BATTING_2B 1 25511 25511 873.9
## - BATTING_1B 1 31824 31824 911.5
##
## Step: AIC=-9686.3
## BATTING_H ~ BATTING_2B + BATTING_3B + BATTING_HR + BATTING_BB +
## BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_SO +
## BATTING_1B
##
## Df Sum of Sq RSS AIC
## <none> 0 -9686.3
## - BASERUN_SB 1 0 0 -9686.3
## - PITCHING_H 1 0 0 -9685.3
## - BATTING_BB 1 0 0 -9685.0
## - PITCHING_HR 1 0 0 -9684.5
## - PITCHING_SO 1 0 0 -9681.5
## - BATTING_SO 1 0 0 -9676.5
## - BATTING_HR 1 204 204 50.9
## - BATTING_3B 1 15432 15432 786.4
## - BATTING_2B 1 25885 25885 874.4
## - BATTING_1B 1 32131 32131 911.1
summary(step_backward_model)
##
## Call:
## lm(formula = BATTING_H ~ BATTING_2B + BATTING_3B + BATTING_HR +
## BATTING_BB + BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_HR +
## PITCHING_SO + BATTING_1B, data = med)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.866e-12 -5.020e-14 2.600e-14 1.005e-13 5.880e-13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.719e-13 7.612e-13 -1.145e+00 0.25374
## BATTING_2B 1.000e+00 2.554e-15 3.915e+14 < 2e-16 ***
## BATTING_3B 1.000e+00 3.308e-15 3.023e+14 < 2e-16 ***
## BATTING_HR 1.000e+00 2.878e-14 3.475e+13 < 2e-16 ***
## BATTING_BB -1.870e-17 4.134e-16 -4.500e-02 0.96398
## BATTING_SO -1.405e-14 5.314e-15 -2.643e+00 0.00904 **
## BASERUN_SB 6.607e-16 6.723e-16 9.830e-01 0.32722
## PITCHING_H -2.819e-15 2.185e-15 -1.290e+00 0.19879
## PITCHING_HR -4.645e-14 2.859e-14 -1.625e+00 0.10613
## PITCHING_SO 1.311e-14 5.172e-15 2.535e+00 0.01221 *
## BATTING_1B 1.000e+00 2.292e-15 4.362e+14 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.109e-13 on 159 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.289e+30 on 10 and 159 DF, p-value: < 2.2e-16
From the three models, model3 is a more parsimonious model. There is no significant difference in R2, Adjusted R2 and RMSE even when i did the treatment for multi-collinearity.
6.1.1 Model 1 : Kitchen Sink Model
plot(model1)
moneyball_test$kitchen_sink <- predict(model1, moneyball_test)
moneyball_test <- moneyball_test %>%
mutate(kitchen_sink_error = TARGET_WINS - kitchen_sink)
ggplot(moneyball_test, aes(kitchen_sink_error)) +
geom_histogram(bins = 50) +
annotate("text",x=0,y=10,
label = paste("RMSE = ",
round(sqrt(mean(moneyball_test$kitchen_sink_error^2)),2)
),
color="white"
)
summary(moneyball_test$kitchen_sink_error)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -28.3735 -6.9033 -0.1124 -0.0408 6.4889 27.6495 247
6.1.2 Model 2 : Simple Model
plot(model2)
moneyball_test$simple <- predict(model2, moneyball_test)
moneyball_test <- moneyball_test %>%
mutate(simple_error = TARGET_WINS - simple)
ggplot(moneyball_test, aes(simple_error)) +
geom_histogram(bins = 50) +
annotate("text",x=0,y=10,
label = paste("RMSE = ",
round(sqrt(mean(moneyball_test$simple_error^2)),2)
),
color="white"
)
summary(moneyball_test$simple_error)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -27.2876 -7.6292 0.2432 -0.1372 6.5731 29.6379 143
6.1.3 Model 3 : Higher Order Stepwise Regression
plot(step_back)
moneyball_test$step_back <- round(predict(step_back, moneyball_test), 0)
moneyball_test <- moneyball_test %>%
mutate(step_back_error = TARGET_WINS - step_back)
moneyball_test %>%
filter(step_back_error > -100) %>%
ggplot(., aes(step_back_error)) +
geom_histogram(bins = 50) +
labs(caption = "Outlier removed")
summary(moneyball_test$step_back_error)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -28.00000 -7.00000 0.00000 -0.04147 6.75000 28.00000 247
7 CONCLUSION
This report covers an attempt to build a model to predict number of wins of a baseball team in a season based on several offensive and deffensive statistics. Resulting model explained about 36% of variability in the target variable and included most of the provided explanatory variables. Some potentially helpful variables were not included in the data set. For instance, number of At Bats can be used to calculate on-base percentage which may correlate strongly with winning percentage. The model can be revised with additional variables or further analysis.
moneyball_test %>%
select(kitchen_sink_error, simple_error, step_back_error) %>%
summary() %>%
kable() %>%
kable_styling()
| kitchen_sink_error | simple_error | step_back_error | |
|---|---|---|---|
| Min. :-28.3735 | Min. :-27.2876 | Min. :-28.00000 | |
| 1st Qu.: -6.9033 | 1st Qu.: -7.6292 | 1st Qu.: -7.00000 | |
| Median : -0.1124 | Median : 0.2432 | Median : 0.00000 | |
| Mean : -0.0408 | Mean : -0.1372 | Mean : -0.04147 | |
| 3rd Qu.: 6.4889 | 3rd Qu.: 6.5731 | 3rd Qu.: 6.75000 | |
| Max. : 27.6495 | Max. : 29.6379 | Max. : 28.00000 | |
| NA’s :247 | NA’s :143 | NA’s :247 |